Introduction

The purpose of this project is to take a player’s indiviual statistics and attempt to determine whether they won or lost that game. I will then use this predictive capability and try to guess the winner of the largest tournament of the year, Call of Duty Champs.


What is the Call of Duty World League?

Call of Duty is a first-person shooter that first began in 2003. Since then, it has become one of the largest multiplayer video game franchises to exist. During this time, a competitive scene for the game has gained traction. In 2016, the Call of Duty World League was born – a sponsored league that hosts major tournaments throughout the year for the best players in the world to play in. In these events, these pros play three different game modes to decide the winner of a series. These game modes are Hardpoint, Search and Destroy, and then a third game mode that often changes yearly. For the data that we are covering, the third game mode is Control. All of the teams in the league consist of 5 players, and the series are Best of 5’s.

embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s")

Game Modes in Circuit

Hardpoint

In Hardpoint, the two teams must fight over a point on the map where every second they spend in this point, they gain one point. This point is called the “hardpoint.” If two teams are in the hardpoint at the same time, then neither teams collects points. Every sixty seconds, the hardpoint changes locations on the map, so teams must make tactical decisions to be able to rotate across the map. The first team to 250 points wins the map.

embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
  use_start_time(6*60 + 35)

Search and Destroy

In Search and Destroy, the two teams play rounds where each player only has one life; if you die, you are dead until the next round. The objective is to either kill the entire other team before the time limit, or if you are on offense, then you can plant the bomb. If the bomb detonates after 45 seconds, then you also win the round. The first team to win 6 rounds wins the map.

embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
  use_start_time(18*60 + 39)

Control

In Control, there is an offense team and a defense team. There are multiple rounds where each team switches off between offense and defense. Each team has 30 lives per round. The first time to win three rounds wins the map. The offensive team is trying to either capture two points on the map, or eliminate all 30 lives of the other team. The defensive team is trying to either defend the two points before the time rounds out, or eliminate all 30 lives of the other team.

embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
  use_start_time(45*60 + 40)

Why is this model useful?

This model is useful because it will allow us to see whether a player’s statistics may have contributed to a win or not. As a fan of COD Competitive, there is a lot of debate on statistics and it’s importance, so I wanted to look directly at the impact of a player’s statistics.


Load Packages

All the packages are loaded below.


Load Data

This project makes use of official CWL data that is uploaded on Github. All data is organized relatively cleanly and all missing data is reported.

proleague2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-07-05-proleague.csv"))
fortworth2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-03-17-fortworth.csv"))
london2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-05-05-london.csv"))
anaheim2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-06-16-anaheim.csv"))
proleagueFinals2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-07-21-proleague-finals.csv"))

# all stats for all major tournaments (EXCEPT CHAMPS) in BO4 (2019)
majors2019 <- rbind(proleague2019, fortworth2019, london2019, anaheim2019, proleagueFinals2019)

# champs will act as our test data; we will try and predict the winner
champs2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-08-18-champs.csv"))

What data are we concerned with?

In order to determine a win for a game, we will need to address Hardpoint, Search and Destroy, and the Control separately. Each of these gamemodes have different parameters, so we will have to fit models for each gamemode.

Hardpoint:
match_id — helpful for getting rid of missing data
win — ‘1’ for a win and ‘0’ for a loss
team — player’s corresponding team player — what player does the data correspond to
mode — game mode
k_d — kill/death ratio; used to show overall impact on the map
assists — in addition to k/d, assists show overall support on the map role — a role is determined for each player depending on their most common gun throughout the year damage_dealt — total damage done in the map
player_spm — score per minute
x_2 — number of two-pieces (two kills in quick succession)
x_3 — number of three-pieces (three kills in quick succession)
x_4 — number of four-pieces (four kills in quick succession)

hill_time_s — hill time measured in seconds
hill_captures — shows activity on the map
hill_defends — shows activity on the map

Search and Destroy: match_id — helpful for getting rid of missing data
win — ‘1’ for a win and ‘0’ for a loss
team — player’s corresponding team player — what player does the data correspond to
mode — game mode
k_d — kill/death ratio; used to show overall impact on the map
assists — in addition to k/d, assists show overall support on the map role — a role is determined for each player depending on their most common gun throughout the year damage_dealt — total damage done in the map
player_spm — score per minute
x_2 — number of two-pieces (two kills in quick succession)
x_3 — number of three-pieces (three kills in quick succession)
x_4 — number of four-pieces (four kills in quick succession)

fb_round_ratio – ‘snd_firstbloods’/‘snd_rounds’ bomb_sneak_defuses – sneak defuses are often in pivotal rounds
bomb_plants – good indicator of role
bomb_defuses – good indicator of role

Control: match_id — helpful for getting rid of missing data
win — ‘1’ for a win and ‘0’ for a loss
team — player’s corresponding team player — what player does the data correspond to
mode — game mode
k_d — kill/death ratio; used to show overall impact on the map
assists — in addition to k/d, assists show overall support on the map role — a role is determined for each player depending on their most common gun throughout the year damage_dealt — total damage done in the map
player_spm — score per minute
x_2 — number of two-pieces (two kills in quick succession)
x_3 — number of three-pieces (three kills in quick succession)
x_4 — number of four-pieces (four kills in quick succession)

ctrl_firstbloods — first kill in a round of control ctrl_firstdeaths — first death in a round of control ctrl_captures — number of captures in a control game


Data Split


Data Cleaning and Organization

The data below is for all of the majors throughout the season, except for COD Champs. We will reserve COD Champs to act as a test set. The raw data from each major is merged into one major dataset, further broken up into Hardpoint, SND, and Control datasets.

All Majors 2019 data

# CLEANING
majors2019 <- majors2019 %>% clean_names(.)

# new dataset that contains all of the missing data, just in case
majors2019_missing <- sqldf('SELECT * FROM majors2019 WHERE match_id LIKE "missing%"')

# whole event data, all players and all maps, where player names are organized alphabetically
majors2019 <- majors2019[order(majors2019$player),]

# removes missing values
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE match_id NOT LIKE "missing%"')

# calculates all the players that have played more than 50 games
player_numgames <- count(majors2019, player) %>% subset(., n > 50) %>% remove_cols(n)

# includes all existing data for all players that have played more than 50 games (arbitrary number)
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE player IN player_numgames')

# removes all matches where damage = 0; almost always occurs as a result of data loss
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE damage_dealt != "0"')

# changes W to 1, L to 0
majors2019$win <- ifelse(majors2019$win == "W", 1, 0) %>%
  as.factor()

# assigning a role to each player to allow for more precise comparisons
playerRoles <- majors2019 %>%
  group_by(player) %>%
  count(player, fave_weapon) %>%
  top_n(1, n) %>%
  mutate(role = fave_weapon) %>%
  subset(select = -c(fave_weapon, n))

# replace fav gun with corresponding role
playerRoles$role <- str_replace(playerRoles$role, "Saug 9mm", "1")
playerRoles$role <- str_replace(playerRoles$role, "Maddox RFB", "2")
playerRoles$role <- str_replace(playerRoles$role, "ICR-7", "3")

# making factors
playerRoles$role <- factor(playerRoles$role)

# manually adjustment for player TJHaly
playerRoles <- playerRoles[-c(83), ]

majors2019 <- dplyr::inner_join(playerRoles, majors2019, by = "player")

A player’s role is defined as a sub (1), flex (2), or an ar (3).

Hardpoint subset

# all 2019 hardpoint data
hp2019 <- sqldf('SELECT player, k_d, role, win, kills, deaths, x, assists, damage_dealt, player_spm, hill_time_s, hill_captures, hill_defends, x2_piece, x3_piece, x4_piece FROM majors2019 WHERE mode == "Hardpoint"')
hp2019 <- hp2019[order(hp2019$player),]

Search and Destroy subset

# all 2019 SND data
snd2019 <- sqldf('SELECT match_id, team, player, role, win, kills, deaths, k_d, assists, damage_dealt, player_spm, bomb_sneak_defuses, bomb_plants, bomb_defuses, snd_rounds, snd_firstbloods, snd_1_kill_round, snd_2_kill_round, snd_3_kill_round, snd_4_kill_round, x2_piece, x3_piece, x4_piece FROM majors2019 WHERE mode == "Search & Destroy"')

# adds new column with fb/round ratio
snd2019 <- add_column(snd2019, fb_round_ratio = snd2019$snd_firstbloods/snd2019$snd_rounds)

# adding a new column with average first bloods for the season
snd2019 <- snd2019 %>%
  group_by(player) %>%
  mutate(fb_avg = mean(snd_firstbloods))

# puts data in alphabetical order
snd2019 <- snd2019[order(snd2019$player),]

Control subset

# all 2019 CONTROL data
control2019 <- sqldf('SELECT player, role, win, k_d, assists, damage_dealt, player_spm, x2_piece, x3_piece, x4_piece, ctrl_firstbloods, ctrl_firstdeaths, ctrl_captures FROM majors2019 WHERE mode == "Control"')
control2019 <- control2019[order(control2019$player),]

Champs 2019 dataset

champs2019 <- champs2019 %>% clean_names(.)
champs2019 <- champs2019[order(champs2019$player),]
champs2019 <- sqldf('SELECT * FROM champs2019 WHERE match_id NOT LIKE "missing%"')
champs2019 <- sqldf('SELECT * FROM champs2019 WHERE damage_dealt != "0"')

# changes W to 1, L to 0
champs2019$win <- ifelse(champs2019$win == "W", 1, 0) %>%
  as.factor()

champs2019 <- dplyr::inner_join(playerRoles, champs2019, by = "player")

Hardpoint CHAMPS subset

# CHAMPS 2019 hardpoint data
hpChamps <- sqldf('SELECT player, k_d, role, win, kills, deaths, x, assists, damage_dealt, player_spm, hill_time_s, hill_captures, hill_defends FROM champs2019 WHERE mode == "Hardpoint"')
hpChamps <- hpChamps[order(hpChamps$player),]

Search and Destroy CHAMPS subset

# CHAMPS 2019 SND data
sndChamps <- sqldf('SELECT player, win, role, k_d, assists, damage_dealt, player_spm, bomb_sneak_defuses, bomb_plants, bomb_defuses, snd_rounds, snd_firstbloods FROM champs2019 WHERE mode == "Search & Destroy"')

# adds new column with fb/round ratio
sndChamps <- add_column(sndChamps, fb_round_ratio = sndChamps$snd_firstbloods/sndChamps$snd_rounds)

# adding a new column with average first bloods for the season
sndChamps <- sndChamps %>%
  group_by(player) %>%
  mutate(fb_avg = mean(snd_firstbloods))

# puts data in alphabetical order
sndChamps <- sndChamps[order(sndChamps$player),]

Control CHAMPS subset

# CHAMPS 2019 CONTROL data
controlChamps <- sqldf('SELECT player, role, win, k_d, assists, damage_dealt, player_spm FROM champs2019 WHERE mode == "Control"')
controlChamps <- controlChamps[order(controlChamps$player),]

Team Hardpoint Data

# getting all necessary data for hardpoint
mergedhp2019 <- sqldf('SELECT match_id, team, player, role, kills, deaths, win, assists, damage_dealt, player_spm, hill_captures, hill_defends FROM majors2019 WHERE mode == "Hardpoint"')

# organizing by each match
mergedhp2019 <- mergedhp2019[order(mergedhp2019$match_id),]

# removing all matches that DON'T include all 10 players
# calculates all the matches that have all 10 players
match_numplayers <- count(mergedhp2019, match_id) %>% subset(., n == 10) %>% remove_cols(n)

# includes matches where all 10 players have existing data
mergedhp2019 <- sqldf('SELECT * FROM mergedhp2019 WHERE match_id IN match_numplayers')


# merge rows so that all the players from each team are one row; expect 800 observations with about 50 variables
test_mergedhp2019 <- mergedhp2019 %>%
  rename(damage = damage_dealt,
         spm = player_spm,
         hillcaptures = hill_captures,
         hilldefends = hill_defends) %>%
   mutate(rn = rowid(match_id, team)) %>% 
   pivot_wider(names_from = rn, values_from = c(win, 
                                                player, 
                                                kills,
                                                deaths,
                                                assists, 
                                                damage, 
                                                spm, 
                                                hillcaptures, 
                                                hilldefends)) %>%
  subset(select = -c(win_2, win_3, win_4, win_5,
                     player_1, player_2, player_3, player_4, player_5)) %>%
  rename(win = win_1)

# team_mergedhp2019 <- test_mergedhp2019 %>%
#   group_by(match_id, team) %>%
#   mutate(kills = sum(kills_1, kills_2, kills_3, kills_4, kills_5),
#          deaths = sum(deaths_1, deaths_2, deaths_3, deaths_4, deaths_5),
#          kd = kills/deaths,
#          assists = sum(assists_1, assists_2, assists_3, assists_4, assists_5),
#          spm = mean(spm_1, spm_2, spm_3, spm_4, spm_5),
#          hillcaptures = sum(hillcaptures_1, hillcaptures_2, hillcaptures_3, hillcaptures_4, hillcaptures_5),
#          hilldefends = sum(hilldefends_1, hilldefends_2, hilldefends_3, hilldefends_4, hilldefends_5),
#          damage = sum(damage_1, damage_2, damage_3, damage_4, damage_5)) %>%
#   subset(select = c(win, kd, assists, spm, hillcaptures, hilldefends, damage))

Team SND Data

# getting all necessary data for hardpoint
team_snd2019 <- sqldf('SELECT match_id, k_d, role, team, player, win, assists, damage_dealt, player_spm, bomb_sneak_defuses, bomb_plants, bomb_defuses, snd_firstbloods, snd_rounds FROM snd2019')

# organizing by each match
team_snd2019 <- team_snd2019[order(team_snd2019$match_id),]

# removing all matches that DON'T include all 10 players
# calculates all the matches that have all 10 players
match_numplayers <- count(team_snd2019, match_id) %>% subset(., n == 10) %>% remove_cols(n)

# includes matches where all 10 players have existing data
team_snd2019 <- sqldf('SELECT * FROM team_snd2019 WHERE match_id IN match_numplayers')


# merge rows so that all the players from each team are one row; expect 800 observations with about 50 variables
team_snd2019 <- team_snd2019 %>%
  rename(kd = k_d,
         damage = damage_dealt,
         spm = player_spm,
         fb = snd_firstbloods, 
         rounds = snd_rounds,
         defuses = bomb_defuses,
         plants = bomb_plants,
         nd = bomb_sneak_defuses) %>%
   mutate(rn = rowid(match_id, team)) %>% 
   pivot_wider(names_from = rn, values_from = c(win, 
                                                player,
                                                kd,
                                                role,
                                                assists, 
                                                damage, 
                                                spm, 
                                                fb, 
                                                rounds,
                                                defuses,
                                                plants,
                                                nd)) %>%
  subset(select = -c(win_2, win_3, win_4, win_5,
                     player_1, player_2, player_3, player_4, player_5,
                     rounds_2, rounds_3, rounds_4, rounds_5,
                     match_id, team)) %>%
  rename(win = win_1) %>%
  rename(rounds = rounds_1)



# team_snd2019 <- team_snd2019 %>%
#   group_by(match_id, team) %>%
#   mutate(kills = sum(kills_1, kills_2, kills_3, kills_4, kills_5),
#          deaths = sum(deaths_1, deaths_2, deaths_3, deaths_4, deaths_5),
#          kd = kills/deaths,
#          assists = sum(assists_1, assists_2, assists_3, assists_4, assists_5),
#          spm = mean(spm_1, spm_2, spm_3, spm_4, spm_5),
#          damage = sum(damage_1, damage_2, damage_3, damage_4, damage_5),
#          fb = sum(fb_1, fb_2, fb_3, fb_4, fb_5),
#          fbratio = fb/rounds_1,
#          plants = sum(plants_1, plants_2, plants_3, plants_4, plants_5),
#          defuses = sum(defuses_1, defuses_2, defuses_3, defuses_4, defuses_5),
#          nd = sum(nd_1, nd_2, nd_3, nd_4, nd_5)) %>%
#   subset(select = c(win, role, kd, kills, deaths, assists, rounds_1, damage, fbratio, plants, defuses, nd))

Exploratory Data Analysis

For my exploratory data analysis, I will be using just the season data. It will not include the Champs data.

Kill/death for season

ggplot(majors2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", subtitle = "OVERALL Player K/D's, 2019 Season (BO4), Descending")

ggplot(hp2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", subtitle = "Player K/D's for HARDPOINT, 2019 Season (BO4), Descending")

ggplot(snd2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 5)) + labs(y = "Kill/death ratio", x = "Player", subtitle = "Player K/D's for SEARCH AND DESTROY, 2019 Season (BO4), Descending")

ggplot(control2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", subtitle = "Player K/D's for CONTROL, 2019 Season (BO4), Descending")

Search and Destroy First Bloods

Search and Destroy is a gamemode that has multiple rounds, where in each round, every player only has one life. A “first blood” is the first kill of the round and is usually highly influential. This a common stat that commentators and the community look at.

Firstblood average

# player firstblood average for SND 2019

ggplot(snd2019, aes(x = reorder(player, fb_avg), y = fb_avg)) + geom_point() + coord_flip(ylim = c(0, 3)) + labs(y = "Firstblood Average", x = "Player", subtitle = "Player Firstblood Average for SEARCH AND DESTROY, 2019 Season (BO4), Descending")

Firstblood totals

# player firstbloods for SND 2019

ggplot(snd2019, aes(x = reorder(player, snd_firstbloods), y = snd_firstbloods)) + geom_boxplot() + coord_flip(ylim = c(0, 6)) + labs(y = "Firstbloods", x = "Player", subtitle = "Player Firstbloods for SEARCH AND DESTROY, 2019 Season (BO4), Descending")

Firstblood/round

# player firstblood/round for SND 2019

ggplot(snd2019, aes(x = reorder(player, fb_round_ratio), y = fb_round_ratio)) + geom_boxplot() + coord_flip(ylim = c(0, 0.6)) + labs(y = "Firstblood/round ratio", x = "Player", subtitle = "Player Firstblood/Round for SEARCH AND DESTROY, 2019 Season (BO4), Descending")

Overall Damage Dealt

# player damage dealt OVERALL 2019

# removes all entries where damage is 0; this is almost always a result of data loss
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE damage_dealt != "0"')
playerDamage <- sqldf('SELECT player, damage_dealt FROM majors2019 WHERE damage_dealt != "0"')

ggplot(playerDamage, aes(x = reorder(player, damage_dealt), y = damage_dealt)) + geom_boxplot() + coord_flip(ylim = c(0, 10000)) + labs(y = "Damage Dealt", x = "Player", subtitle = "OVERALL Player Damage Dealt, 2019 Season (BO4), Descending")

Overall Score/Min (spm)

# Overall score per minute for 2019 season

ggplot(majors2019, aes(x = reorder(player, player_spm), y = player_spm)) + geom_boxplot() + coord_flip(ylim = c(0, 675)) + labs(y = "Score per minute", x = "Player", subtitle = "OVERALL Player Score per minute, 2019 Season (BO4), Descending")

Number of Wins

# Overall number of wins for 2019 season

playerwins <- sqldf('SELECT player, win FROM majors2019 WHERE win == "1"') # selects all the wins for each player
playerwins <- playerwins %>% count(player) # counts the number of wins per player

ggplot(playerwins, aes(x = reorder(player, n), y = n)) + geom_bar(stat = 'identity') + coord_flip() + labs(y = "Number of Wins", x = "Player", subtitle = "OVERALL Number of Wins per Player, 2019 Season (BO4), Descending")

The top 4 players with the most amount of wins in the season are Slasher, Octane, Kenny, and Enable. The interesting part about this is that all of these players were on the same team, 100 Thieves. They all tied with 116 wins during the season.

playerwins %>%
  ggplot(aes(x = n)) + geom_histogram(binwidth = 15, color = "black", fill = "white")

The number of wins appears to follow a normal distribution. The left side of the histogram appears to be slightly more populated, but I hypothesize that this is due to players that didn’t play for the whole season.


Models: Hardpoint

I will be trying to predict whether an individual player will win or lose a game based on his statistics in the given game.

Splitting Data:

Here, I am splitting the hardpoint data with 80% training and 20% testing. The data is stratified on the “win” variable.

hp2019_wl <- hp2019

set.seed(3068)

hp2019_wlsplit <- hp2019_wl %>%
  initial_split(prop = 0.8, strata = "win")

hp2019_train <- training(hp2019_wlsplit)
hp2019_test <- testing(hp2019_wlsplit)

Below is the head of the training data; as well as the dimensions for the training and the testing data. There is also the distribution for the number of wins.

head(hp2019_train)
##    player  k_d role win kills deaths   x assists damage_dealt player_spm
## 4   Abezy 0.66    1   0    19     29 -10       6         3891      290.9
## 5   Abezy 1.18    1   0    26     22   4       8         4480      393.3
## 12  Abezy 0.88    1   0    22     25  -3      14         4515      322.3
## 15  Abezy 0.76    1   0    19     25  -6       8         4868      295.7
## 16  Abezy 0.83    1   0    20     24  -4       4         3954      269.5
## 20  Abezy 1.33    1   0    28     21   7       7         4733      400.3
##    hill_time_s hill_captures hill_defends x2_piece x3_piece x4_piece
## 4           48             4            5        2        0        0
## 5           55             5           14        2        1        0
## 12          80             9            6        4        0        0
## 15          77             7           11        5        0        0
## 16          35             3            6        2        2        0
## 20          71             6           10        3        0        0
dim(hp2019_train)
## [1] 3551   16
dim(hp2019_test)
## [1] 889  16
prop.table(table(hp2019_train$win))
## 
##         0         1 
## 0.4987328 0.5012672

Creating a recipe and folding:

The begin with my Hardpoint models, I made a recipe that contained all of my predictor variables. I then normalized all of my predictor variables.

hp_recipe <- recipe(win ~ k_d + assists + damage_dealt + 
                        player_spm + hill_time_s + hill_captures + 
                        hill_defends + x2_piece + x3_piece + x4_piece, 
                      data = hp2019_train) %>%
  step_normalize(all_predictors())

After making my recipe, I decided to fold my data with 10 folds and 5 repeats.

hp_train_folds <- vfold_cv(hp2019_train, v = 10, repeats = 5)

Model 1: Decision Tree

Creating a general decision tree specification using rpart:

hp_tree_spec <- decision_tree() %>%
  set_engine("rpart")

Setting a classification decision tree engine:

hp_class_tree_spec <- hp_tree_spec %>%
  set_mode("classification")

Fitting the model:

hp_class_tree_fit <- hp_class_tree_spec %>%
  fit(win ~ k_d + assists + damage_dealt + 
                        player_spm + hill_time_s + hill_captures + 
                        hill_defends + x2_piece + x3_piece + x4_piece, 
                      data = hp2019_train)

Visualizing the decision tree:

hp_class_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot()
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

Checking confusion matrix and accuracy of the train data:

augment(hp_class_tree_fit, new_data = hp2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class)
##           Truth
## Prediction    0    1
##          0 1379  601
##          1  392 1179
hp_dt_accuracy <- augment(hp_class_tree_fit, new_data = hp2019_train) %>%
  accuracy(truth = win, estimate = .pred_class)
hp_dt_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.720

Creating a workflow that is ready to tune cost complexity:

hp_class_tree_wf <- workflow() %>%
  add_model(hp_class_tree_spec %>% set_args(cost_complexity = tune())) %>%
  add_recipe(hp_recipe)

Setting up a regular grid:

param_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)

Fitting and tuning our model:

hp_dt_tune <- hp_class_tree_wf %>%
  tune_grid(
  hp_class_tree_wf, 
  resamples = hp_train_folds, 
  grid = param_grid, 
  metrics = metric_set(accuracy))
## Warning: The `...` are not used in this function but one or more objects were
## passed: ''

Plotting our model, which shows what cost-complexity produces the highest accuracy:

autoplot(hp_dt_tune)

Selecting the best performing value and finalizing the workflow:

hp_best_complexity <- select_best(hp_dt_tune)

hp_class_tree_final <- finalize_workflow(hp_class_tree_wf, hp_best_complexity)

hp_class_tree_final_fit <- fit(hp_class_tree_final, data = hp2019_train)

Visualizing the final model:

hp_class_tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint = FALSE)

Checking the accuracy of the final model:

hp_tuned_dt_accuracy <- augment(hp_class_tree_final_fit, new_data = hp2019_train) %>%
  accuracy(truth = win, estimate = .pred_class)

As we can see, the final accuracy for the tuned model was slightly higher with an estimate of 0.7321881, compared to that of the untuned model with an estimate of 0.7203605.


Model 2: Random Forest

Now it was time to prepare my model. I tuned min_n and mtry, set my mode to “classification”, and set my engine to “ranger.” My workflow was set up to use both my Hardpoint recipe and my Hardpoint random forest model.

hp_rf_model <- rand_forest(min_n = tune(),
                        mtry = tune(),
                        mode = "classification") %>%
  set_engine("ranger")

hp_rf_workflow <- workflow() %>%
  add_model(hp_rf_model) %>%
  add_recipe(hp_recipe)

Next, I set up parameters for the grid that I was going to make. The parameters were set the Hardpoint random forest model and the mtry range was set from 1 to 7. This mtry limit was set slightly below the maximum number of predictors.

hp_rf_parameters <- hardhat::extract_parameter_set_dials(hp_rf_model) %>%
  update(mtry = mtry(range = c(2, 10)))

hp_rf_grid <- grid_regular(hp_rf_parameters, levels = 2)

Then, I ran my model by tuning and fitting, using my folded data and my grid.

hp_rf_tune <- hp_rf_workflow %>%
  tune_grid(resamples = hp_train_folds,
            grid = hp_rf_grid)

The last thing to do was to plot my tuned model.

autoplot(hp_rf_tune)

As we can see from the above plot, it appears that as we add more predictor variables, the accuracy tends to decrease. I hypothesize that this is because k_d is the most significant predictor by far. All of the other predictors are much less significant, and actually worsen the model by overfitting. However, the decrease in accuracy is very small in general.

Checking the accuracy of the final model:

hp_rf_tuned_accuracy <- show_best(hp_rf_tune, metric = "accuracy")
hp_rf_tuned_accuracy
## # A tibble: 4 × 8
##    mtry min_n .metric  .estimator  mean     n std_err .config             
##   <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1     2     2 accuracy binary     0.744    50 0.00304 Preprocessor1_Model1
## 2     2    40 accuracy binary     0.744    50 0.00280 Preprocessor1_Model3
## 3    10    40 accuracy binary     0.739    50 0.00300 Preprocessor1_Model4
## 4    10     2 accuracy binary     0.734    50 0.00298 Preprocessor1_Model2
hp_rf_tuned_accuracy[1,5]
## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1 0.744

We had the highest accuracy of 0.7427786 with a minimum node size of 40 and an mtry of 2.


Model 3: Logistic Regression

First, I needed to set up my model. I set my engine to “glm” for logistic regression and set the mode to “classification.”

hp_log_reg <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

Setting up workflow with the model I created last step, as well as the recipe that I created earlier.

hp_log_wkflow <- workflow() %>% 
  add_model(hp_log_reg) %>% 
  add_recipe(hp_recipe)

Fit the model to the folded data:

hp_log_fit <- fit_resamples(hp_log_wkflow, hp_train_folds)

Collecting metrics based on the folded data:

collect_metrics(hp_log_fit)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.748    50 0.00339 Preprocessor1_Model1
## 2 roc_auc  binary     0.834    50 0.00296 Preprocessor1_Model1

Fitting the model to the whole dataset, not just the folds:

hp_log_fit_train <- fit(hp_log_wkflow, hp2019_train)

Assessing model performance with the training data:

predict(hp_log_fit_train, new_data = hp2019_train, type = "prob")
## # A tibble: 3,551 × 2
##    .pred_0 .pred_1
##      <dbl>   <dbl>
##  1   0.845  0.155 
##  2   0.745  0.255 
##  3   0.242  0.758 
##  4   0.860  0.140 
##  5   0.914  0.0865
##  6   0.342  0.658 
##  7   0.816  0.184 
##  8   0.668  0.332 
##  9   0.707  0.293 
## 10   0.638  0.362 
## # … with 3,541 more rows
augment(hp_log_fit_train, new_data = hp2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class)
##           Truth
## Prediction    0    1
##          0 1380  499
##          1  391 1281
augment(hp_log_fit_train, new_data = hp2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Checking accuracy with the training data:

hp_log_reg_accuracy <- augment(hp_log_fit_train, new_data = hp2019_train) %>%
  accuracy(truth = win, estimate = .pred_class)
hp_log_reg_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.749

Model 4: K Nearest Neighbors

Setting up the model. I will be tuning “neighbors.”

hp_knn_model <- 
  nearest_neighbor(
    neighbors = tune(),
    mode = "classification") %>% 
  set_engine("kknn")

Next, I set up the workflow.

hp_knn_workflow <- workflow() %>% 
  add_model(hp_knn_model) %>% 
  add_recipe(hp_recipe)

I then set up the tuning grid.

hp_knn_parameters <- hardhat::extract_parameter_set_dials(hp_knn_model)
hp_knn_grid <- grid_regular(hp_knn_parameters, levels = 2)

Fitting and tuning my model:

hp_knn_tune <- hp_knn_workflow %>%
  tune_grid(resamples = hp_train_folds,
            grid = hp_knn_grid)

Plotting the model:

autoplot(hp_knn_tune, metric = "accuracy")

This plot shows us that as the number of neighbors increases, there is also an increase in accuracy.

Testing the accuracy of the model:

hp_knn_tuned_accuracy <- show_best(hp_knn_tune, metric = "accuracy")
hp_knn_tuned_accuracy
## # A tibble: 2 × 7
##   neighbors .metric  .estimator  mean     n std_err .config             
##       <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1        15 accuracy binary     0.709    50 0.00334 Preprocessor1_Model2
## 2         1 accuracy binary     0.649    50 0.00329 Preprocessor1_Model1

As we can see, the model performs at its best with 15 neighbors, resulting in an accuracy of 0.7094363.


Comparing Model Performance

hp_accuracies <- c(hp_tuned_dt_accuracy$.estimate,
                    hp_rf_tuned_accuracy[1,5],
                    hp_log_reg_accuracy$.estimate, 
                    hp_knn_tuned_accuracy[1, 4])
hp_accuracies
## [[1]]
## [1] 0.7555618
## 
## $mean
## [1] 0.7438462
## 
## [[3]]
## [1] 0.7493664
## 
## $mean
## [1] 0.7091532

As we can see from this, the decision tree model appears to have the highest accuracy.


Models: Search and Destroy

I will be trying to predict whether an individual player will win or lose a game based on his statistics in the given game.

Splitting Data:

Here, I am splitting the Search and Destroy data with 80% training and 20% testing. The data is stratified on the “win” variable.

set.seed(1)

snd2019_split <- snd2019 %>%
  initial_split(prop = 0.8, strata = "win")

snd2019_train <- training(snd2019_split)
snd2019_test <- testing(snd2019_split)

Below is the head of the training data; as well as the dimensions for the training and the testing data. There is also the distribution for the number of wins.

head(snd2019_train)
## # A tibble: 6 × 25
## # Groups:   player [1]
##   match_id      team  player role  win   kills deaths   k_d assists damage_dealt
##   <chr>         <chr> <chr>  <fct> <fct> <dbl>  <dbl> <dbl>   <dbl>        <dbl>
## 1 737440468739… eUni… Abezy  1     0        11      7  1.57       0         1355
## 2 144010197940… eUni… Abezy  1     0         7      7  1          1         1938
## 3 259560665349… eUni… Abezy  1     0         2      7  0.29       4          796
## 4 146217688303… eUni… Abezy  1     0         7      7  1          0         1275
## 5 177954718572… eUni… Abezy  1     0         9      8  1.12       1         1561
## 6 167778343948… eUni… Abezy  1     0        10      9  1.11       3         1434
## # … with 15 more variables: player_spm <dbl>, bomb_sneak_defuses <dbl>,
## #   bomb_plants <dbl>, bomb_defuses <dbl>, snd_rounds <dbl>,
## #   snd_firstbloods <dbl>, snd_1_kill_round <dbl>, snd_2_kill_round <dbl>,
## #   snd_3_kill_round <dbl>, snd_4_kill_round <dbl>, x2_piece <dbl>,
## #   x3_piece <dbl>, x4_piece <dbl>, fb_round_ratio <dbl>, fb_avg <dbl>
dim(snd2019_train)
## [1] 2791   25
dim(snd2019_test)
## [1] 699  25
prop.table(table(snd2019_train$win))
## 
##        0        1 
## 0.498746 0.501254

Creating a recipe and folding:

The begin with my Hardpoint models, I made a recipe that contained all of my predictor variables. I then normalized all of my predictor variables.

snd_recipe <- recipe(win ~ k_d + assists + damage_dealt + 
                        player_spm + bomb_sneak_defuses + 
                        bomb_plants + bomb_defuses + snd_firstbloods + 
                       fb_round_ratio + snd_1_kill_round +
                       snd_2_kill_round + snd_3_kill_round +
                       snd_4_kill_round + x2_piece + x3_piece + x4_piece, 
                      data = snd2019_train) %>%
  step_normalize(all_predictors())

After making my recipe, I decided to fold my data with 10 folds and 5 repeats.

snd_train_folds <- vfold_cv(snd2019_train, v = 10, repeats = 5)

Model 1: Decision Tree

Creating a general decision tree specification using rpart:

snd_tree_spec <- decision_tree() %>%
  set_engine("rpart")

Setting a classification decision tree engine:

snd_class_tree_spec <- snd_tree_spec %>%
  set_mode("classification")

Fitting the model:

snd_class_tree_fit <- snd_class_tree_spec %>%
  fit(win ~ k_d + assists + damage_dealt + 
                        player_spm + bomb_sneak_defuses + 
                        bomb_plants + bomb_defuses + snd_firstbloods + 
                       fb_round_ratio + snd_1_kill_round +
                       snd_2_kill_round + snd_3_kill_round +
                       snd_4_kill_round + x2_piece + x3_piece + x4_piece, 
                      data = snd2019_train)

Visualizing the decision tree:

snd_class_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot()
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

Checking confusion matrix and accuracy of the train data:

augment(snd_class_tree_fit, new_data = snd2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class)
##           Truth
## Prediction    0    1
##          0 1099  544
##          1  293  855
snd_dt_accuracy <- augment(snd_class_tree_fit, new_data = snd2019_train) %>%
  accuracy(truth = win, estimate = .pred_class)
snd_dt_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.700

Creating a workflow that is ready to tune cost complexity:

snd_class_tree_wf <- workflow() %>%
  add_model(snd_class_tree_spec %>% set_args(cost_complexity = tune())) %>%
  add_recipe(snd_recipe)

Setting up a regular grid:

parameter_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)

Fitting and tuning our model:

snd_rf_tune <- snd_class_tree_wf %>%
  tune_grid(resamples = snd_train_folds, grid = parameter_grid)

Plotting our model, which shows what cost-complexity produces the highest accuracy:

autoplot(snd_rf_tune)

Selecting the best performing value and finalizing the workflow:

snd_best_complexity <- select_best(snd_rf_tune, metric = "accuracy")

snd_class_tree_final <- finalize_workflow(snd_class_tree_wf, snd_best_complexity)

snd_class_tree_final_fit <- fit(snd_class_tree_final, data = snd2019_train)

Visualizing the final model:

snd_class_tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint = FALSE)

Checking the accuracy of the final model:

snd_tuned_dt_accuracy <- augment(snd_class_tree_final_fit, new_data = snd2019_train) %>% accuracy(truth = win, estimate = .pred_class)
snd_tuned_dt_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.742

As we can see, the final accuracy for the tuned model was slightly higher with an estimate of 0.7416697, compared to that of the untuned model with an estimate of 0.7001075.


Model 2: Random Forest

Now it was time to prepare my model. I tuned min_n and mtry, set my mode to “classification”, and set my engine to “ranger.” My workflow was set up to use both my Hardpoint recipe and my Hardpoint random forest model.

snd_rf_model <- rand_forest(min_n = tune(),
                        mtry = tune(),
                        mode = "classification") %>%
  set_engine("ranger")

snd_rf_workflow <- workflow() %>%
  add_model(snd_rf_model) %>%
  add_recipe(snd_recipe)

Next, I set up parameters for the grid that I was going to make. The parameters were set the Hardpoint random forest model and the mtry range was set from 1 to 10. This mtry limit was set slightly below the maximum number of predictors.

snd_rf_parameters <- hardhat::extract_parameter_set_dials(snd_rf_model) %>%
  update(mtry = mtry(range = c(1, 10)))

snd_rf_grid <- grid_regular(snd_rf_parameters, levels = 2)

Then, I ran my model by tuning and fitting, using my folded data and my grid.

snd_rf_tune <- snd_rf_workflow %>%
  tune_grid(resamples = snd_train_folds,
            grid = snd_rf_grid)

The last thing to do was to plot my tuned model.

autoplot(snd_rf_tune)

As we can see from the above plot, when we increase the number of predictors, our accuracy and ROC and AUC slightly increases. I hypothesize that this is different from the Hardpoint plot because there are more predictors that are significant in Search and Destroy.

Checking the accuracy of the final model:

snd_rf_tuned_accuracy <- show_best(snd_rf_tune, metric = "accuracy")
snd_rf_tuned_accuracy
## # A tibble: 4 × 8
##    mtry min_n .metric  .estimator  mean     n std_err .config             
##   <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1    10    40 accuracy binary     0.709    50 0.00373 Preprocessor1_Model4
## 2    10     2 accuracy binary     0.703    50 0.00376 Preprocessor1_Model2
## 3     1     2 accuracy binary     0.668    50 0.00463 Preprocessor1_Model1
## 4     1    40 accuracy binary     0.666    50 0.00500 Preprocessor1_Model3
snd_rf_tuned_accuracy[1,5]
## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1 0.709

We had the highest accuracy of 0.7091295 with a minimum node size of 40 and an mtry of 10.


Model 3: Logistic Regression

First, I needed to set up my model. I set my engine to “glm” for logistic regression and set the mode to “classification.”

snd_log_reg <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

Setting up workflow with the model I created last step, as well as the recipe that I created earlier.

snd_log_wkflow <- workflow() %>% 
  add_model(snd_log_reg) %>% 
  add_recipe(snd_recipe)

Fit the model to the folded data:

snd_log_fit <- fit_resamples(snd_log_wkflow, snd_train_folds)
## ! Fold01, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold02, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold03, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold04, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold05, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold06, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold07, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold08, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold09, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold10, Repeat1: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold01, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold02, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold03, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold04, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold05, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold06, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold07, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold08, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold09, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold10, Repeat2: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold01, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold02, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold03, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold04, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold05, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold06, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold07, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold08, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold09, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold10, Repeat3: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold01, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold02, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold03, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold04, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold05, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold06, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold07, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold08, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold09, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold10, Repeat4: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold01, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold02, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold03, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold04, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold05, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold06, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold07, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold08, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold09, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...
## ! Fold10, Repeat5: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0...

Collecting metrics based on the folded data:

collect_metrics(snd_log_fit)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.732    50 0.00328 Preprocessor1_Model1
## 2 roc_auc  binary     0.807    50 0.00379 Preprocessor1_Model1

Fitting the model to the whole dataset, not just the folds:

snd_log_fit_train <- fit(snd_log_wkflow, snd2019_train)

Assessing model performance with the training data:

predict(snd_log_fit_train, new_data = snd2019_train, type = "prob")
## # A tibble: 2,791 × 2
##    .pred_0 .pred_1
##      <dbl>   <dbl>
##  1   0.571   0.429
##  2   0.739   0.261
##  3   0.712   0.288
##  4   0.654   0.346
##  5   0.771   0.229
##  6   0.825   0.175
##  7   0.486   0.514
##  8   0.748   0.252
##  9   0.433   0.567
## 10   0.458   0.542
## # … with 2,781 more rows
augment(snd_log_fit_train, new_data = snd2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class)
##           Truth
## Prediction    0    1
##          0 1134  474
##          1  258  925
augment(snd_log_fit_train, new_data = snd2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Checking accuracy with the training data:

snd_log_reg_accuracy <- augment(snd_log_fit_train, new_data = snd2019_train) %>%
  accuracy(truth = win, estimate = .pred_class)
snd_log_reg_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.738

Model 4: K Nearest Neighbors

Setting up the model. I will be tuning “neighbors.”

snd_knn_model <- 
  nearest_neighbor(
    neighbors = tune(),
    mode = "classification") %>% 
  set_engine("kknn")

Next, I set up the workflow.

snd_knn_workflow <- workflow() %>% 
  add_model(snd_knn_model) %>% 
  add_recipe(snd_recipe)

I then set up the tuning grid.

snd_knn_parameters <- hardhat::extract_parameter_set_dials(snd_knn_model)
snd_knn_grid <- grid_regular(snd_knn_parameters, levels = 2)

Fitting and tuning my model:

snd_knn_tune <- snd_knn_workflow %>%
  tune_grid(resamples = snd_train_folds,
            grid = snd_knn_grid)

Plotting the model:

autoplot(snd_knn_tune, metric = "accuracy")

This plot shows us that as the number of neighbors increases, there is also an increase in accuracy.

Testing the accuracy of the model:

snd_knn_tuned_accuracy <- show_best(snd_knn_tune, metric = "accuracy")
snd_knn_tuned_accuracy
## # A tibble: 2 × 7
##   neighbors .metric  .estimator  mean     n std_err .config             
##       <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1        15 accuracy binary     0.619    50 0.00403 Preprocessor1_Model2
## 2         1 accuracy binary     0.595    50 0.00385 Preprocessor1_Model1

As we can see, the model performs at its best with 15 neighbors, resulting in an accuracy of 0.6188415.


Comparing Model Performance

snd_accuracies <- c(snd_tuned_dt_accuracy$.estimate,
                    snd_rf_tuned_accuracy[1,5],
                    snd_log_reg_accuracy$.estimate, 
                    snd_knn_tuned_accuracy[1, 4])
snd_accuracies
## [[1]]
## [1] 0.7416697
## 
## $mean
## [1] 0.7091295
## 
## [[3]]
## [1] 0.7377284
## 
## $mean
## [1] 0.6188415

As we can see from this, the decision tree appears to have the highest accuracy.


Models: Control

I will be trying to predict whether an individual player will win or lose a game based on his statistics in the given game.

Splitting Data:

Here, I am splitting the Search and Destroy data with 80% training and 20% testing. The data is stratified on the “win” variable.

set.seed(1)

control2019_split <- control2019 %>%
  initial_split(prop = 0.8, strata = "win")

control2019_train <- training(control2019_split)
control2019_test <- testing(control2019_split)

Below is the head of the training data; as well as the dimensions for the training and the testing data. There is also the distribution for the number of wins.

head(control2019_train)
##    player role win  k_d assists damage_dealt player_spm x2_piece x3_piece
## 4   Abezy    1   0 1.30      17         6459      386.2        5        1
## 16  Abezy    1   0 0.88       4         5353      215.9        2        0
## 18  Abezy    1   0 1.00       3         2657      285.7        3        0
## 21  Abezy    1   0 0.70       4         3077      225.0        1        0
## 22  Abezy    1   0 0.75       5         3357      217.5        2        1
## 23  Abezy    1   0 0.78       5         4315      215.3        1        1
##    x4_piece ctrl_firstbloods ctrl_firstdeaths ctrl_captures
## 4         0                1                0             4
## 16        0                1                0             2
## 18        0                0                0             1
## 21        0                0                0             1
## 22        0                1                1             1
## 23        0                1                0             2
dim(control2019_train)
## [1] 2122   13
dim(control2019_test)
## [1] 532  13
prop.table(table(control2019_train$win))
## 
##         0         1 
## 0.4995287 0.5004713

Creating a recipe and folding:

The begin with my Hardpoint models, I made a recipe that contained all of my predictor variables. I then normalized all of my predictor variables.

control_recipe <- recipe(win ~ k_d + assists + damage_dealt + 
                        player_spm + ctrl_firstbloods +
                          ctrl_firstdeaths + ctrl_captures + 
                          x2_piece + x3_piece + x4_piece, 
                      data = control2019_train) %>%
  step_normalize(all_predictors())

After making my recipe, I decided to fold my data with 10 folds and 5 repeats.

control_train_folds <- vfold_cv(control2019_train, v = 10, repeats = 5)

Model 1: Decision Tree

Creating a general decision tree specification using rpart:

control_tree_spec <- decision_tree() %>%
  set_engine("rpart")

Setting a classification decision tree engine:

control_class_tree_spec <- control_tree_spec %>%
  set_mode("classification")

Fitting the model:

control_class_tree_fit <- control_class_tree_spec %>%
  fit(win ~ k_d + assists + damage_dealt + 
                        player_spm + ctrl_firstbloods +
                          ctrl_firstdeaths + ctrl_captures + 
                          x2_piece + x3_piece + x4_piece, 
                      data = control2019_train)

Visualizing the decision tree:

control_class_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot()
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

Checking confusion matrix and accuracy of the train data:

augment(control_class_tree_fit, new_data = control2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class)
##           Truth
## Prediction   0   1
##          0 742 250
##          1 318 812
control_dt_accuracy <- augment(control_class_tree_fit, new_data = control2019_train) %>%
  accuracy(truth = win, estimate = .pred_class)
control_dt_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.732

Creating a workflow that is ready to tune cost complexity:

control_class_tree_wf <- workflow() %>%
  add_model(control_class_tree_spec %>% set_args(cost_complexity = tune())) %>%
  add_recipe(control_recipe)

Setting up a regular grid:

parameter_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)

Fitting and tuning our model:

control_rf_tune <- control_class_tree_wf %>%
  tune_grid(resamples = control_train_folds, grid = parameter_grid)

Plotting our model, which shows what cost-complexity produces the highest accuracy:

autoplot(control_rf_tune)

Selecting the best performing value and finalizing the workflow:

control_best_complexity <- select_best(control_rf_tune, metric = "accuracy")

control_class_tree_final <- finalize_workflow(control_class_tree_wf, control_best_complexity)

control_class_tree_final_fit <- fit(control_class_tree_final, data = control2019_train)

Visualizing the final model:

control_class_tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint = FALSE)

Checking the accuracy of the final model:

control_tuned_dt_accuracy <- augment(control_class_tree_final_fit, new_data = control2019_train) %>% accuracy(truth = win, estimate = .pred_class)
control_tuned_dt_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.827

As we can see, the final accuracy for the tuned model was slightly higher with an estimate of 0.7416697, compared to that of the untuned model with an estimate of 0.7001075.


Model 2: Random Forest

Now it was time to prepare my model. I tuned min_n and mtry, set my mode to “classification”, and set my engine to “ranger.” My workflow was set up to use both my Hardpoint recipe and my Hardpoint random forest model.

control_rf_model <- rand_forest(min_n = tune(),
                        mtry = tune(),
                        mode = "classification") %>%
  set_engine("ranger")

control_rf_workflow <- workflow() %>%
  add_model(control_rf_model) %>%
  add_recipe(control_recipe)

Next, I set up parameters for the grid that I was going to make. The parameters were set the Hardpoint random forest model and the mtry range was set from 1 to 10. This mtry limit was set slightly below the maximum number of predictors.

control_rf_parameters <- hardhat::extract_parameter_set_dials(control_rf_model) %>%
  update(mtry = mtry(range = c(1, 9)))

control_rf_grid <- grid_regular(control_rf_parameters, levels = 2)

Then, I ran my model by tuning and fitting, using my folded data and my grid.

control_rf_tune <- control_rf_workflow %>%
  tune_grid(resamples = control_train_folds,
            grid = control_rf_grid)

The last thing to do was to plot my tuned model.

autoplot(control_rf_tune)

As we can see from the above plot, when we increase the number of predictors, our accuracy and ROC and AUC slightly increases. I hypothesize that this is different from the Hardpoint plot because there are more predictors that are significant in Search and Destroy.

Checking the accuracy of the final model:

control_rf_tuned_accuracy <- show_best(control_rf_tune, metric = "accuracy")
control_rf_tuned_accuracy
## # A tibble: 4 × 8
##    mtry min_n .metric  .estimator  mean     n std_err .config             
##   <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1     9    40 accuracy binary     0.749    50 0.00453 Preprocessor1_Model4
## 2     9     2 accuracy binary     0.742    50 0.00443 Preprocessor1_Model2
## 3     1     2 accuracy binary     0.740    50 0.00420 Preprocessor1_Model1
## 4     1    40 accuracy binary     0.736    50 0.00438 Preprocessor1_Model3
control_rf_tuned_accuracy[1,5]
## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1 0.749

We had the highest accuracy of 0.7091295 with a minimum node size of 40 and an mtry of 10.


Model 3: Logistic Regression

First, I needed to set up my model. I set my engine to “glm” for logistic regression and set the mode to “classification.”

control_log_reg <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

Setting up workflow with the model I created last step, as well as the recipe that I created earlier.

control_log_wkflow <- workflow() %>% 
  add_model(control_log_reg) %>% 
  add_recipe(control_recipe)

Fit the model to the folded data:

control_log_fit <- fit_resamples(control_log_wkflow, control_train_folds)

Collecting metrics based on the folded data:

collect_metrics(control_log_fit)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.767    50 0.00504 Preprocessor1_Model1
## 2 roc_auc  binary     0.841    50 0.00467 Preprocessor1_Model1

Fitting the model to the whole dataset, not just the folds:

control_log_fit_train <- fit(control_log_wkflow, control2019_train)

Assessing model performance with the training data:

predict(control_log_fit_train, new_data = control2019_train, type = "prob")
## # A tibble: 2,122 × 2
##    .pred_0 .pred_1
##      <dbl>   <dbl>
##  1  0.0608   0.939
##  2  0.855    0.145
##  3  0.717    0.283
##  4  0.875    0.125
##  5  0.830    0.170
##  6  0.810    0.190
##  7  0.620    0.380
##  8  0.511    0.489
##  9  0.534    0.466
## 10  0.339    0.661
## # … with 2,112 more rows
augment(control_log_fit_train, new_data = control2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class)
##           Truth
## Prediction   0   1
##          0 849 277
##          1 211 785
augment(control_log_fit_train, new_data = control2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Checking accuracy with the training data:

control_log_reg_accuracy <- augment(control_log_fit_train, new_data = control2019_train) %>%
  accuracy(truth = win, estimate = .pred_class)
control_log_reg_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.770

Model 4: K Nearest Neighbors

Setting up the model. I will be tuning “neighbors.”

control_knn_model <- 
  nearest_neighbor(
    neighbors = tune(),
    mode = "classification") %>% 
  set_engine("kknn")

Next, I set up the workflow.

control_knn_workflow <- workflow() %>% 
  add_model(control_knn_model) %>% 
  add_recipe(control_recipe)

I then set up the tuning grid.

control_knn_parameters <- hardhat::extract_parameter_set_dials(control_knn_model)
control_knn_grid <- grid_regular(control_knn_parameters, levels = 2)

Fitting and tuning my model:

control_knn_tune <- control_knn_workflow %>%
  tune_grid(resamples = control_train_folds,
            grid = control_knn_grid)

Plotting the model:

autoplot(control_knn_tune, metric = "accuracy")

This plot shows us that as the number of neighbors increases, there is also an increase in accuracy.

Testing the accuracy of the model:

control_knn_tuned_accuracy <- show_best(control_knn_tune, metric = "accuracy")
control_knn_tuned_accuracy
## # A tibble: 2 × 7
##   neighbors .metric  .estimator  mean     n std_err .config             
##       <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1        15 accuracy binary     0.735    50 0.00426 Preprocessor1_Model2
## 2         1 accuracy binary     0.679    50 0.00472 Preprocessor1_Model1

As we can see, the model performs at its best with 15 neighbors, resulting in an accuracy of 0.6188415.


Comparing Model Performance

control_accuracies <- c(control_tuned_dt_accuracy$.estimate,
                    control_rf_tuned_accuracy[1,5],
                    control_log_reg_accuracy$.estimate, 
                    control_knn_tuned_accuracy[1, 4])
control_accuracies
## [[1]]
## [1] 0.82705
## 
## $mean
## [1] 0.7494743
## 
## [[3]]
## [1] 0.7700283
## 
## $mean
## [1] 0.7350554

As we can see from this, the decision tree appears to have the highest accuracy.